Home > Computers & Technology > Programming Languages

Python Web Scraping by Katharine Jarmul

Author:Katharine Jarmul , Date: April 11, 2018 ,Views: 270

Python Web Scraping by Katharine Jarmul

Author:Katharine Jarmul
Language: eng
Format: epub
Publisher: Packt Publishing
Published: 2017-05-29T09:14:54+00:00

def process_queue():

while len(crawl_queue):

url = crawl_queue.pop()

...

The first change is replacing our Python list with the new Redis-based queue, named RedisQueue. This queue handles duplicate URLs internally, so the seen variable is no longer required. Finally, the RedisQueue len method is called to determine if there are still URLs in the queue. Further logic changes to handle the depth and seen functionality are shown here:

## inside process_queue

if no_robots or rp.can_fetch(user_agent, url):

depth = crawl_queue.get_depth(url) or 0

if depth == max_depth:

print('Skipping %s due to depth' % url)

continue

html = D(url, num_retries=num_retries)

if not html:

continue

if scraper_callback:

links = scraper_callback(url, html) or []

else:

links = []

# filter for links matching our regular expression

for link in get_links(html, link_regex) + links:

if 'http' not in link:

link = clean_link(url, domain, link)

crawl_queue.push(link)

crawl_queue.set_depth(link, depth + 1)

The full code can be seen at http://github.com/kjam/wswp/blob/master/code/chp4/threaded_crawler_with_queue.py.

This updated version of the threaded crawler can then be started using multiple processes with this snippet:

import multiprocessing

Download

Python Web Scraping by Katharine Jarmul.epub

Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.

Categories

Linux & Unix	iPhone & iOS
Macintosh	Android
Business Technology	Certification
Computer Science	Databases & Big Data
Digital Audio, Video & Photography	Games & Strategy Guides
Graphics & Design	Hardware & DIY
History & Culture	Internet & Social Media
Mobile Phones, Tablets & E-Readers	Networking & Cloud Computing
Operating Systems	Programming
Programming Languages	Security & Encryption
Software	Web Development & Design

Popular ebooks

Deep Learning with Python by François Chollet(12627)
Hello! Python by Anthony Briggs(9943)
OCA Java SE 8 Programmer I Certification Guide by Mala Gupta(9818)
The Mikado Method by Ola Ellnestam Daniel Brolund(9809)
Algorithms of the Intelligent Web by Haralambos Marmanis;Dmitry Babenko(8330)
Sass and Compass in Action by Wynn Netherland Nathan Weizenbaum Chris Eppstein Brandon Mathis(7806)
Test-Driven iOS Development with Swift 4 by Dominik Hauser(7784)
Grails in Action by Glen Smith Peter Ledbrook(7719)
The Well-Grounded Java Developer by Benjamin J. Evans Martijn Verburg(7590)
Windows APT Warfare by Sheng-Hao Ma(7089)
Layered Design for Ruby on Rails Applications by Vladimir Dementyev(6830)
Blueprints Visual Scripting for Unreal Engine 5 - Third Edition by Marcos Romero & Brenden Sewell(6705)
Secrets of the JavaScript Ninja by John Resig Bear Bibeault(6441)
Kotlin in Action by Dmitry Jemerov(5089)
Hands-On Full-Stack Web Development with GraphQL and React by Sebastian Grebe(4334)
Solidity Programming Essentials by Ritesh Modi(4142)
Functional Programming in JavaScript by Mantyla Dan(4056)
WordPress Plugin Development Cookbook by Yannick Lefebvre(3933)
Unity 3D Game Development by Anthony Davis & Travis Baptiste & Russell Craig & Ryan Stunkel(3883)
The Ultimate iOS Interview Playbook by Avi Tsadok(3852)